Explore Spotify's Song Data and Musical Features

Our enriched master tables provides us with fertile ground for exploratory data analysis as we seek to understand the playlist groupings and song features for all songs in the provided dataset.

Out[6]:
Index(['artist_name', 'artist_uri', 'track_name', 'album_uri', 'duration_ms',
       'album_name', 'count', 'track_uri', 'danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'time_signature', 'artist_genres',
       'artist_popularity', 'album_genres', 'album_popularity',
       'album_release_date'],
      dtype='object')

Playlist Distributions

We begin by seeking to understand how many songs appear in each playlist as well as how many playlists each song appears in. There is a very strong right skew for song distribution, as a few songs appear in an enormously large number of playlists while the vast majority of songs appear in very few playlists overall.

The median number of songs in a playlist is about 20, though this too has a long right tail. The maximum number of songs in a single playlist in our dataset is 341.

Largest playlist 341
Smallest playlist:  3

For our newly enriched musical features, we leverage Seaborn distribution plots to understand the nature of each of these columns. Some have a relatively even distribution but even more appear to be centered around a small value. Song duration centers expectedly around 3-4 minutes, though there are a few songs with extremely long lengths. These are presumed to be songs in playlists related to sleep, where frequent switching is detrimental to a relaxing environment. Danceability, a critical feature in these author's opinions, centers around a score of 0.6 but energy has a left skew with a center around 0.8-0.9. Liveness, instrumentalness, speechiness and loudness all have fairly centered values with a few songs exhibiting different behaviors. The overall gathering of song features around a few feature values is an interesting takeaway.

Time elapsed: 3.9 seconds

Pairwise associations

We next explore whether any features exhibit correlations between themselves. The majority appear to be completely randomly related, with data distributions materializing as complete squares on the pairplot graphs. There are some potential relationships, which we explore in detail next.

Time elapsed: 275.631 seconds

Energy and loudness have the most visually linear relationship, where as energy increases so too does loudness. We have affectionality labeled this the "Haley's Comet of Music".

Time elapsed: 561.528 seconds

Tempo and danceability, two features that have the most normal distributions, appear to mirror each other as well, with peaks at a tempo of 125 and a danceability range between 0.6 and 0.8.

Time elapsed: 817.11 seconds

Charting that same relationship between tempo and danceability on a scatterplot highlights that peak concentration even better. The "Tempo Bump" occurs at about 125 beats-per-minute (bpms) with a clear jump in danceability.

Time elapsed: 2.337 seconds

Are certain song features associated with playlist inclusion?

We next turned to whether song features had any relationship to playlist inclusion. Are certain features more predictive of the number of playlists they are likely to be included in?

We started with danceability and see the hint of a linear relationship, where as danceability increases so too does the number of playlists it will be included in. There remains a lot of song noise though, taken from the songs that we saw in our inclusion distributions earlier as just not being popular and added to only a few playlists.

Time elapsed: 1.953 seconds

Loudness has a clear spike at about -8 but with fairly rigid hard limits above and below that. All songs softer than -20 are included in very few playlists. The same holds true for songs louder than ~0.

Time elapsed: 1.95 seconds

Energy actually appears to have very little impact on playlist inclusion, with songs at every level of energy included at high amounts. Playlists vary across their desired theme, from slower classical music to upbeat electronic, so this mirrors our expectations.

Time elapsed: 1.973 seconds

There is an expected relationship between artist_popularity as presented by Spotify and the number of playlists those songs are included in.

Time elapsed: 1.954 seconds

That expected linear relationship does not hold for album_popularity though. Instead, there is a clear delineation of playlist inclusion for both very popular albums and very unpopular albums. We posit that this is due to the phenomenon of one-hit wonders, whereby an album might be panned by critics and music fans overall but it was listened to and reviewed because it had one or two very popular songs. Those individual songs will be included in playlists, thanks to the flexibility of streaming services, even though the album overall remains unpopular. It is the middle range of albums, without any hit songs that are just ok, that are most at risk of not being included in playlists.

Time elapsed: 5.503 seconds

Song duration (duration_ms), similar to loudness, has a very narrow band of lengths that users consider acceptable for playlist inclusion. This length is about 3.5 minutes, with songs longer than ~4 minutes appearing in an extremely low number of playlists.

Time elapsed: 1.935 seconds

The last musical feature we explored was tempo, which we broke out into a more detailed histogram. Our hypothesis was that there would exist several popular tempoes for different genres of songs, something we saw hinted at in our analysis of energy. This proved true, with peaks occuring at 80, 100, 120, 128, 139 and 170, which we believe map to different genres of music.

Moving on from exploring feature relationships, we next explored how these features have been changing over time for songs released in different years. We derived a new release_year column from the album_release_date column and show a preview below. We filter to all songs released between 1950 and 2017 to ensure sufficient data for our chart and show the distribution of release years currently in Spotify's library below.

song_id
0         1993
159583    1993
271702    1993
445190    1993
626275    1993
Name: release_year, dtype: object
Time elapsed: 10.914 seconds
Time elapsed: 0.411 seconds

Varying features occur at different scales so we use MinMaxScaler from sklearn to scale them all between 0 and 1 and chart the changes on a single plot. The most dramatic cross over occurs with acousticness and energy, as the former declines severely starting in the 1950s and the latter rises steadily over time. The trend reverses itself temporarily during the 1980s but diverges again at a slower pace from 1990 to today. Valence, Spotify’s measure of positivity in a song, has also been declining at a slow but steady pace since about 1977.

Time elapsed: 119.136 seconds

Loudness has been reaching record heights beginning in the 1990s. This may be associated with the spike in popularity that decade of electronic music due to the proliferation of cheap music production technology.

Time elapsed: 17.608 seconds

The most recent trend occurs with danceability, spiking to its highest ever levels starting in 2010.

Time elapsed: 17.19 seconds

Lastly, we wanted to explore popular musical keys for songs as well as a sample of popular genres. If you filter to the Top 100 most included songs, there are clear preferences in song key at First Key (C♯/D♭) and Seventh Key (G, or sol). For playlists, sleep is actually a surprisingly popular genre, likely related to the number of extremely long songs. Unfortunately, beyond that, Spotify's structuring of genre makes analysis difficult. They include many specific genres for each album without specifying whether there is a primary genre. This means that there are potentially more popular genres than sleep, but in a simple histogram analysis, they will be reduced in frequency due to the varying nature of genre inclusions.

999950 100
Time elapsed: 0.528 seconds
Out[58]:
[]                                                                            175993
['sleep']                                                                       3401
['gospel']                                                                      2128
['classical', 'classical era', 'early romantic era']                            2014
['contemporary country', 'country', 'country road', 'modern country rock']      1954
['classical', 'classical era']                                                  1920
['ccm', 'christian alternative rock', 'christian music', 'worship']             1847
['baroque', 'classical', 'early music', 'german baroque']                       1766
['soundtrack']                                                                  1765
['banda', 'grupera', 'norteno', 'regional mexican']                             1727
Name: artist_genres, dtype: int64